144

Applications in Natural Language Processing

The activations are binarized following previous works as:

ˆXi

B = Sign(Xi

R) =



1,

if Xi

R < 0

+1,

if Xi

R 0

(5.37)

In that case,

ˆ

XB

T ˆXB = nXR, where nXR is number of elements in XR, and αcan be

solved as:

α= XR

T ˆXB

nXR

= ||XR||l1

nXR

(5.38)

For the activations in attention layers or after the ReLU non-linearity layers with XR

Rn

+, the authors binarized the activations to ˆXB ∈{0, 1}n by rounding the real-valued

activations:

ˆXi

B =Clip(Xi

R, 0, 1)=



0,

if Xi

R < 0.5

1,

if Xi

R 0.5

(5.39)

In that case,

ˆ

XB

T ˆXB = n{XR0.5} where n{XR0.5} denotes the number of elements in XR

that are greater than or equal to 0.5. Then αcan be solved as:

α= ||XR · 1{XR0.5}||l1

n{XR0.5}

(5.40)

5.10.2

Elastic Binarization Function

The fixed scaling and threshold derived previously works reasonably well, but might not be

optimal since it ignores the distribution of the variable which is being binarized. Ideally,

these parameters can be learned during training to minimize the target loss.

When using classical binarization methods, i.e., ˆXi

B = Sign(Xi

R), the binary output

is independent of the scale of the real-valued input. However, in our case where ˆXi

B =

Clip(Xi

R, 0, 1), this independence no longer holds. Learning the scaling and threshold

parameters, and how to approximate the gradients precisely in the process becomes crucial

for the final accuracy.

To handle this, the authors proposed the elastic binarization function to learn both the

scale αR+ and the threshold βR:

Xi

B = α ˆXi

B = αClip(Xi

R β

α

, 0, 1)

(5.41)

In the function, α is initialized with αin Eq. (5.38) and β to be 0, and it is trained with

gradients from the final loss. To back-propagate the gradients to α through the discretized

binarization function, the straight-through estimator (STE) [9] is leveraged to bypass the

incoming gradients to the round function to be the outgoing gradients:

Xi

B

∂α

= ˆXi

B + α∂ˆXi

B

∂α

ST E

ˆXi

B + α∂Clip( Xi

Rβ

α

, 0, 1)

∂α

=

0,

if Xi

R < β

βXi

R

α

,

if βXi

R < α/2 + β

1Xi

Rβ

α

,

if α/2 + βXi

R < α + β

1,

if Xi

R α + β

(5.42)